comparing k-means clusters on parallel persian-english corpus

Authors

a. khazaei

m. ghasemzadeh

abstract

this paper compares clusters of aligned persian and english texts obtained from k-means method. text clustering has many applications in various fields of natural language processing. so far, much english documents clustering research has been accomplished. now this question arises, are the results of them extendable to other languages? since the goal of document clustering is grouping of documents based on their content, it is expected that the answer to this question is yes. on the other hand, many differences between various languages can cause the answer to this question to be no. this research has focused on k-means that is one of the basic and popular document clustering methods. we want to know whether the clusters of aligned persian and english texts obtained by the k-means are similar. to find an answer to this question, mizan english-persian parallel corpus was considered as benchmark. after features extraction using text mining techniques and applying the pca dimension reduction method, the k-means clustering was performed. the morphological difference between english and persian languages caused the larger feature vector length for persian. so almost in all experiments, the english results were slightly richer than those in persian. aside from these differences, the overall behavior of persian and english clusters was similar. these similar behaviors showed that results of k-means research on english can be expanded to persian. finally, there is hope that despite many differences between various languages, clustering methods may be extendable to other languages.

Upgrade to premium to download articles

Already have an account?login

similar resources

Comparing k-means clusters on parallel Persian-English corpus

This paper compares clusters of aligned Persian and English texts obtained from k-means method. Text clustering has many applications in various fields of natural language processing. So far, much English documents clustering research has been accomplished. Now this question arises, are the results of them extendable to other languages? Since the goal of document clustering is grouping of docum...

full text

MIZAN: A Large Persian-English Parallel Corpus

One of the most major and essential tasks in natural language processing is machine translation that is now highly dependent upon multilingual parallel corpora. Through this paper, we introduce the biggest Persian-English parallel corpus with more than one million sentence pairs collected from masterpieces of literature. We also present acquisition process and statistics of the corpus, and expe...

full text

TEP: Tehran English-Persian Parallel Corpus

Parallel corpora are one of the key resources in natural language processing. In spite of their importance in many multi-lingual applications, no large-scale English-Persian corpus has been made available so far, given the difficulties in its creation and the intensive labors required. In this paper, the construction process of Tehran English-Persian parallel corpus (TEP) using movie subtitles,...

full text

PEN: Parallel English-Persian News Corpus

Parallel corpora are the necessary resources in many multilingual natural language processing applications, including machine translation and cross-lingual information retrieval. Manual preparation of a large scale parallel corpus is a very time consuming and costly procedure. In this paper, the work towards building a sentence-level aligned EnglishPersian corpus in a semi-automated manner is p...

full text

Extracting an English-Persian Parallel Corpus from Comparable Corpora

Parallel data are an important part of a reliable Statistical Machine Translation (SMT) system. The more of these data are available, the better the quality of the SMT system. However, for some language pairs such as Persian-English, parallel sources of this kind are scarce. In this paper, a bidirectional method is proposed to extract parallel sentences from English and Persian document aligned...

full text

Creating a Persian-English Comparable Corpus

Multilingual corpora are valuable resources for cross-language information retrieval and are available in many language pairs. However the Persian language does not have rich multilingual resources due to some of its special features and difficulties in constructing the corpora. In this study, we build a Persian-English comparable corpus from two independent news collections: BBC News in Englis...

full text

My Resources

Save resource for easier access later

Save to my library Already added to my library

{@ msg_add @}

Journal title:

journal of ai and data mining

Publisher: shahrood university of technology

ISSN 2322-5211

volume 3

issue 2 2015

Keywords

clustering mizan english persian parallel corpus k means principal component analysis (pca)

Hosted on Doprax cloud platform doprax.com